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REMARKS 

This response is filed in reply to the Office Action mailed August 18, 2003. Claims 1-4, 8 
and 9 have been canceled by the above amendment and new claims 24-50 have been added. 
Support for the new claims can be found throughout the specification. Specifically, support for 
claims 24-30 can be found at page 5, lines 1-29. Support for claims 3 1-33 can be found at page 12, 
lines 29-31, bridging to page 13, lines 1-18. Support for claims 36-38 can be found at page 14, 
lines 21-25. Support for new claims 39-50 can be found in Example 10 (see at page 30, lines 1- 
19). 

Applicants note that non-elected claims 14-23 have been withdrawn from consideration. 
Applicants will request that withdrawn claims 17-23 be rejoined pursuant to MPEP §821.04 once 
the elected claims are deemed allowable. 

No new matter has been added. Claims 24-50 are pending and at issue. Applicants request 

reconsideration of the pending claims. 



INTERVIEW SUMMARY 

A telephonic interview between the Examiner and Applicants representative took place on 
July 10, 2003. Applicants representative requested that a substitute office action be provided 
because the Gonzalez reference had not been included in the original office action. Applicants 
representative further requested that the time for responding to the office action be reset. 

OBJECTION 

Claim 8 is objected to as being dependent upon a non-elected base claim. This objection 
is moot in view of the cancellation of claim 8. 

L REJECTION UNDER 35 U.S.C. 8101 

Claims 1-4, 8 and 9 stand rejected under 35 U.S.C. §101 as allegedly directed to non- 
statutory subject matter. This rejection is moot in view of the cancellation of claims 1-4, 8 and 9. 
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Applicants note that the new claims now recite an "isolated" polypeptide, clearly distinguishing 
the claimed compositions from products of nature. 

IL REJECTIONS UNDER 35 U.S.C §112. FIRST PARAGRAPH 

Written Description 

Claims 1-3 and 8 stand rejected under 35 U.S.C. §1 12, first paragraph, as allegedly 
containing subject matter not described in the specification in such a way as to reasonably convey 
to one skilled in the relevant art that the inventor, at the time the application was filed, had 
possession of the claimed invention. This rejection is moot in view of the cancellation of claims 1- 
3 and 8. Applicants traverse this rejection as it may apply to the new claims. 

The Office Action states that the claims are drawn to a "genus of (R)-2,3-dehydrogenase, 
with any structure and from any source." Applicants submit that one skilled in the art would not 
expect the genus of proteins encompassed by the pending claims to have substantial variation. The 
new claims encompass a narrow range of variants including those (a) having up to 50 conservative 
amino acid substitutions, (b) hybridizing under high stringency conditions with the nucleic acid of 
SEQ ID NO: 1, or (c) having at least 70% identity to the amino acid sequence of SEQ ID NO: 2. 
In addition, many of the pending claims are substantially narrower than those. As there is little 
variation within the claimed genus, a single example is sufficient to demonstrate possession of the 
claimed genus of polypeptides. 

For example, claim 24 recites an isolated polypeptide comprising a sequence with "70%" 
sequence identity to the polypeptide of SEQ ID NO:2. The polypeptides encompassed by claim 24 
are further limited to butanediol dehydrogenases possessing specific functional characteristics 
recited in parts (a), (b) and (c) of claim 24. Any such proteins would necessarily be significantly 
similar in terms of structure. Similar arguments are applicable to the polypeptides claimed in claim 
26. Dependent claims 25, 27 and 3 1-33 narrow the scope even more. 

Claim 28 recites specific chemical and physical properties such as enzymatic activity and 
molecular weight. The characteristics recited in claim 28 clearly limit the claimed genus of 
polypeptides to those having very similar structure and function, i.e., they are limited to a 
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butanediol dehydrogenase with the specific chartacteristics set forth in the claim. The "Guidelines 
for Examination of Patent Applications Under the 35 USC §1 12, Written Description 
Requirement" (herein after "Guidelines") support claiming in this manner. The Guidelines 
indicate that patent examiners are required to determine "what each claim, as a whole, covers" (66 
Fed. Reg. 1099, at 1 105). The Guidelines further indicate that the disclosure of any combination of 
identifying characteristics that "distinguish the claimed invention from other materials and would 
lead one of skill in the art to the conclusion that the applicant was in possession of the claimed 
species" is sufficient to comply with the written description requirement (Id at 1 106). 

Claims 39, 42, 45 and 48 recite individual fragments, or a combination of fragments, of the 
polypeptide set forth in SEQ ID NO:2. For example, claim 39 recites a polypeptide that is a (R)- 
2,3-butanediol dehydrogenase and comprises the amino acid sequence of SEQ ID NO:5 (Phe-His- 
Ala-Ala-Phe-Asp). Claim 39 further recites functional characteristics that the claimed 
dehydrogenase must possess. Applicants submit that the fragments disclosed in the originally filed 
specification provide "blaze marks" ( In re Ruschi g, 379 F.2d 990, 994, (CCPA 1967)) directing the 
skilled artisan to polypeptides possessing the required activities of the claimed polypeptides. A 
person of skill in the art would not expect substantial variation among species encompassed within 
the scope of the claims. Similar arguments can be made for claim 42 (SEQ ID NO:4; Ala-Thr-Ser- 

His-Cys-Ser-Asp-Arg-Ser-^ claim 45 (SEQ 

ID NO:3; Lys-Pro-Gly-Asp-Arg-Val-Ala-Val-Glu-Ala) and claim 48 (SEQ ID NOs:3, 4 and 5). 

In view of the limitations recited in the pending claims, the skilled artisan would recognize 
that the inventor was in possession of the claimed polypeptides as of the filing date of the 
application. 

Enablement 

Claims 1-3 and 8 also stand rejected under §112, first paragraph, as allegedly containing 
subject matter not described in the specification in such a way as to enable one of skill in the art 
to make or use the invention. This rejection is moot in view of the cancellation of claims 1-3 and 
8. Applicants traverse this rejection as it may apply to the new claims. 
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The Office Action alleges that the specification does not reasonably provide enablement 
for an (R)-2,3-butanediol dehydrogenase different from SEQ ID NO:2. Applicants respectfully 
disagree. As noted above, new claim 24 claims a polypeptide having at least 70% sequence 
identity to the amino acid sequence set forth in SEQ ID NO:2. In addition, the claimed 
polypeptide is limited to a "butanediol dehydrogenase" having the functional characteristics 
recited in parts (a), (b) and (c) of claim 24. With regard to the functional limitations set forth in 
parts (a) and (b), the specification provides an assay for identifying oxidizing (2R,3R)-2,3- 
butanediol activity recited in part (a) (see page 8, lines 6-12) and an assay for identifying activity 
of oxidizing glycerol recited in part (b) (page 8, lines 13-18). The.specific activity of the enzyme 
recited in part (c) can be calculated from the previously described assays. Producing a 
polypeptide that comprises a sequence at least 70% identical to SEQ ID NO:2 is a trivial task 
that any molecular biologist could accomplish using standard recombinant DNA techj^ues. 
Applicants have taught how to assay for the indicated activities. Any polypeptide tha^bssegps 
the activity specified in the claim is presumptively useful as a (R)-2,3-butanediol dehjjjogerase. 
Thus, the specification teaches both how to make and how to use the claimed polypeiQes. TJie 
enablement requirement is clearly met. CL 

Similar arguments can be made for pending claims 26, 28, 39, 42, 45 and 48. One skilled ^ 
in the art could make the claimed polypeptides from the disclosures in the patent coupled with 
information known in the art without undue experimentation. By providing the appropriate 
nucleic acid and amino acid sequences, along with additional information regarding sequence 
comparison programs, hybridization conditions, and enzymatic assays, Applicants have 
presented the skilled artisan with all the information necessary to make the claimed compositions 
using only routine experimentation. Once made, the polypeptides can be used as described in the 
specification. 

On page 6, the Office Action asserts that the specification fails to establish the specific 
structures responsible for the butanediol dehydrogenase activity of the claimed polypeptide or 
what changes in the structures could be made without abolishing this activity (see page 6, parts 
(A) through (D)). The Office Action concludes that undue experimentation would be required to 
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enable the full scope of the pending claims because neither the specification nor the art provides 
any guidance as to which of the polypeptides encompassed by the claims will retain the activity 
of the polypeptide set forth in SEQ ID NO:2 (i.e., butanediol dehydrogenase activity). 

Applicants first note that, because the claim requires that the polypeptide possess 
butanediol dehydrogenase activity, every polypeptide encompassed by the claims necessarily 
retains that activity. Thus, encompassing inactive proteins is simply not an issue. 

Second, Applicants agree that it is possible, at least in some cases, to abolish activity of a 
given protein by mutating a critical residue. However, Applicants disagree that this fact means 
that one of ordinary skill cannot make functional analogs of the claimed polypeptide without 
undue experimentation. In support of this, Applicants refer to Bowie et al. (Science 247:1306; 
copy enclosed as Appendix A) which teaches, at page 1306, col.2, lines 12-13, that "proteins are 
surprisingly tolerant of amino acid substitutions." Bowie et al. cites as evidence a study carried 
out on the lac repressor. Of approximately 1500 single amino acid substitutions at 142 positions 
in this protein, about one-half of the substitutions were found to be "phenotypically silent": that 
is, had no noticeable effect on the activity of the protein (page 1306, col. 2, lines 14-17). 
Presumably the other half of the substitutions exhibited effects ranging from slight to complete 
abolishment of repressor activity. Thus, one can expect, based on Bowie et al.'s teachings, to 
find over half (and possibly well over half) of random substitutions in any given protein to result 
in mutated proteins with full or nearly full activity. These are far better odds than those at issue 
in In re Wands , 858 F.2d 731 (Fed. Cir. 1988), in which the court said that screening many 
hybridomas to find the few that fell within the claims was not undue experimentation. The 
question is not whether it is possible to abolish activity with a single amino acid change, but 
rather whether one of ordinary skill can produce, without undue experimentation, mutants in 
which the activity is not abolished. Based on Bowie et al.'s teachings, one would predict that 
even random substitution of residues in the claimed polypeptide will result in a majority of the 
mutants' having full or partial butanediol dehydrogenase activity. 

Furthermore, the specification amply teaches how to make and test mutants to find those 
with the dehydrogenase activity required by the claims. For example, there is little doubt that the 
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provision of an amino acid sequence (e.g., SEQ ID NO:2) necessarily provides a person of 
average skill in the art with enough information to enable him or her to make a sequence 
containing, for example, up to 50, 30, or 10 conservative amino acid substitutions. Further, 
generation of such mutants is a routine task. Similarly, there is little doubt that the provision of 
the amino acid sequence of a fragment (e.g., SEQ ID NO:3, 4 and 5) necessarily provides a 
person of average skill in the art with enough information to enable him or her to make a 
butanediol dehydrogenase containing that sequence by simply following the teachings of the 
specification. Plainly, these experiments would not be "undue." 

In view of the limitations recited in the pending claims, the skilled artisan could make the 
claimed polypeptides from the disclosures in the patent coupled with information known in the 
art without undue experimentation. 

IIL REJECTIONS TINDER 35 U.S.C. 8112. SECON D PARAGRAPH 

Claims 1-3 and 8 stand rejected under 35 U.S.C. §112, second paragraph, as allegedly 
indefinite for failing to particularly point out and distinctly claim the subject matter which the 
Applicants regard as the invention. This rejection is moot in view of the cancellation of claims 
1-3 and 8. Applicants note that the new claims do not employ the language that the office action 
describes as confusing. 

IV. REJECTION UNDER 35 U.S.C. §102 

§102(a) 

Claims 1 and 8 stand rejected under 35 U.S.C. § 102(a) as allegedly anticipated by 
Gonzalez et al. Applicants traverse this rejection to the extent it may be applied to the presently 
pending claims, but also submits herewith a certified copy of a translation of the Japanese patent 
application to which the pending application claims priority. The filing date of the priority 
document is October 3 1 , 2000. The publication date of the cite reference of Gonzalez is 
November 17, 2000. Accordingly, the cited reference cannot anticipate the pending claims. 



Applicant : Hiroaki Yamamoto et al. Attorney's Docket No.: 14875-092001 / D1-A0009-US 

Serial No. : 10/020,674 

Filed : October 30, 2001 

Page : 18 of 20 

§102(b) 

Claims 1 and 8 stand rejected under 35 U.S.C. § 102(b) as allegedly anticipated by 
Heidlas et al. This rejection is moot in view of the cancellation of claims 1 and 8. Applicants 
traverse this rejection as it may be applied to the new claims. 

The dehydrogenase disclosed in Heidlas was purified from Saccharomyces cerevisiae. 
While Heidlas fails to provide any nucleic acid or amino acid sequence information associated 
with their purified dehydrogenase, the cited reference of Gonzalez does disclose the sequence of 
a dehydrogenase derived from Saccharomyces cerevisiae. Applicants contend that the > 
dehydrogenase of Gonzalez is the same as the dehydrogenase disclosed in Heidlas. Appendix B, 
part 1, which accompanies the present response, provides a comparison of the amino acid 
sequence set forth in SEQ ID NO:2 (derived from the genus Pichia) with the amino acid 
sequence disclosed in Gonzalez (derived from the genus Saccharomyces). The sequence 
comparison clearly indicates that the two sequence share only about 48% homology. In 
contrast, the subject matter of claim 24 is limited to those polypeptides having "70%" sequence 
homology to SEQ ID NO:2. Claims 34 and 35 claim polypeptides that "consist of or 
"comprise" the amino acid sequence set forth in SEQ ID NO:2. The subject matter of claims 
36-38 is limited to polypeptides that differ from SEQ ID NO:2 by, at most, 13% (i.e., 50 amino 
acid substitutions). Thus, the claimed polypeptides would still be required to have at least 87% 
sequence homology with SEQ ID NO:2. Accordingly, the subject matter of the previously 
mentioned claims are not anticipated by the cited reference of Heidlas. 

New claims 39, 42, 45 and 48 recite SEQ ID NO:5, SEQ ID NO:4, SEQ ID NO:3 or 
SEQ ID NOs:5, 4 and 3, respectively. These amino acid sequences are tryptic peptide 
fragments of SEQ ID NO:2. The sequence comparison provided in Appendix B, part 1, clearly 
indicates that none of the fragments exist in the sequence set forth in the cited reference. Thus, 
polypeptides of claims 39, 42, 45 and 48 can not be anticipated by the cited reference. 

Claim 26 claims a polypeptide encoded by a polynucleotide that is at least 80% 
identical to a polynucleotide comprising the nucleotide sequence of SEQ ID NO:l. Applicants 
submit that the amino acid sequence data provided in Appendix B, part 1, strongly indicates 
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that the cited reference fails to teach a nucleic acid sequence with the limitations recited in 
claim 26, or a polypeptide encoded thereby. Accordingly, the cited reference fails to anticipate 

the subject matter of claim 26. 

Claim 28 is limited to a dehydrogenase that ... "has a molecular weight of about 36,000 
Da when determined by sodium dodecyl sulfate-polyacrylamide gel electrophoresis and about 
76,000 Da when determined by gel filtration." At page 269, column 2, under the heading 
"Molecular mass," Heidlas discloses a dehydrogenase having a molecular weight of about 
"140,000" by gel filtration, clearly distinguishing the dehydrogenase of the cited reference 
from the dehydrogenase of claim 28. 

In view of the new claims and in light of the above discussion, Applicants submit that 
the cited reference fails to anticipate each and every element of the claimed invention. 

V. PROVISIONAL OBVIOUSNESS-TYPE DOUBLE PATENTING 

Claims 1 and 8 stand provisionally rejected under the judicially created doctrine of 
obviousness-type double patenting as allegedly unpatentable over claims 1 and 3-4 of co-pending 
Application No. 10/147,003 ('003). Applicants believe that this rejection is moot in view of the 
cancellation of claims 1 and 8. It is further believed that the claims pending in the present 
application are patentably distinct from those of the '003 co-pending application because the 
sequences (both nucleic acid and amino acid) claimed in the present application do not 
encompass the sequences claimed in the '003 application. For example, the homology between 
SEQ ID NO:l of the present invention and the SEQ ID NO:l of the '003 application (see page 
16 of the '003 specification) is only about 4% (see Appendix B, part 2). The homology between 
SEQ ID NO:2 of the invention and the SEQ ID NO:2 of '003 at page 16 is also very low (only 
53%) (see Appendix B, part 3). 
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In summary, for the reasons set forth herein, Applicants maintain that claims 24-50 
clearly and patentably define the invention. Applicants request that the Examiner reconsider and 
withdraw the various grounds for rejection set forth in the Office Action. 

If the Examiner would like to discuss any of the issues raised in the Office Action, 
Applicants' representative can be reached at (617) 542-5070. Enclosed is a $1262 check for 
excess claim fees and a one month extension fee. Please apply any other charges or credits to 
deposit account 06-1050. 



Date: l^h[C)3 

Fish & Richardson P.C. 
225 Franklin Street 
Boston, MA 02110-2804 
Telephone: (617) 542-5070 
Facsimile: (617)542-8906 



Respectfully submitted, 



Janis K. Fraser, Ph.D., J.D. L/ 
Reg. No. 34,819 1 
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Deciphering the Message in Protein Sequences: 
Tolerance to Amino Acid Substitutions 



James U. Bowie/ John F. Rbidhaar-Olson, Wendell A. Lim, 

Robert T. Sauer 



An amino add sequence encodes a message that deter- 
mines die shape and function of a protein. This message is 
highly degenerate in that many different sequences can 
code for proteins with essentially the same structure and 
activity. Comparison of different sequences with similar 
messages can reveal key features of the code and improve 
understanding of how a protein folds and how it per- 
forms its function. 



THE GBNOME IS MANIFEST LARGELY IN THE SET OF PRO 
tcins that it encodes. It is the ability of these proteins to fold 
into unique three-dimensional structures that allows them to 
function and carry out the instructions of the genome. Thus, 
comprehending the rules that relate amino acid sequence to struc- 
ture is fundamental to an understanding of biological processes. 
Because an amino acid sequence contains all of the information 
necessary to determine the structure of a protein {!), it should be 
possible to predict structure from sequence, and subsequently to 
infer detailed aspects of function from the structure. However, both 
problems are extremely complex, and it seems unlikely that either 
will be solved in an exact manner in the near future. It may be 
possible to obtain approximate solutions by using experimental data 
to simplify the problem. In this article, we describe how an analysis 
of allowed amino acid substitutions in proteins can be used to 
reduce the complexity of sequences and reveal important aspects of 
structure and function. 



Methods for Studying Tolerance to 
Sequence Variation 

There are two main approaches to studying the tolerance of an 
amino acid sequence to change. The first method relics on the 
process of evolution, in which mutations arc cither accepted or 
rejected by natural selection. This method has been extremely 
powerful for proteins such as the globins or cytochromes, for which 
sequences from many different species are known (2-7), The second 
approach uses genetic methods to introduce amino acid changes at 
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specific positions in a cloned gene and uses selections or screens to 
identify functional sequences. This approach has been used to great 
advantage for proteins that can be expressed in bacteria or yeast, 
where the appropriate genetic manipulations are possible (3, 8-11). 
The end results of both methods arc lists of active sequences that can 
be compared and analyzed to identify sequence features that arc 
essential for folding or function. If a particular property of a side 
chain, such as charge or size, is important at a given position, only 
side chains that have the required property will be allowed. Con- 
versely, if the chemical identity of the side chain is unimportant, 
then many different substitutions will be permitted 

Studies in which these methods were used have revealed that 
proteins are surprisingly tolerant of amino acid substitutions (2-4, 
11). For example, in studying the effects of approximately 1500 
single amino acid substitutions at 142 positions in lac repressor, 
Miller and co-workers found that about one-half of all substitutions 
were phenotypically silent (it). At some positions, many different, 
nonconservative substitutions were allowed. Such residue positions 
play little or no role in structure and function. At other positions, no 
substitutions or only conservative substitutions were allowed. These 
residues are the most important for lac repressor activity. 

What roles do invariant and conserved side chains play in 
proteins? Residues mat are directly involved in protein functions 
such as binding or catalysis will certainly be among the most 
conserved. For example, replacing the Asp in the catalytic triad of 
trypsin with Asn results in a lOMbld reduction in activity (12). A 
similar loss of activity occurs in X repressor when a DNA binding 
residue is changed from Asn to Asp (13). To carry out their 
function, however, these catalytic residues and binding residues 
must be precisely oriented in three dimensions. Consequently, 
mutations in residues that arc required for structure formation or 
stability can also have dramatic effects on activity (10, 14-16). 
Hence, many of the residues that are conserved in sets of related 
sequences play structural roles. 



Substitutions at Surface and Buried Positions 

In their initial comparisons of the globin sequences, Perutz and 
co-workers found that most buried residues require nonpolar side 
chains, whereas few features of surface side chains are generally 
conserved (<5). Similar results have been seen for a number of protein 
families (2, 4, 5, 7, 17, 1 8). An example of the sequence tolerance at 
surface versus buried sites can be seen in Fig. 1, which shows the 
allowed substitutions in X repressor at residue positions that are near 
the dimcr interface but distant from the DNA binding surface of the 
protein (9). These substitutions were identified by a functional 

scibncb, vol. 247 



Rg. 1 . ( A) Amino acid substitutions allowed in a 
short region of X repressor. The wild-type se- 
quence is shown along the center line. The al- 
lowed substitutions shown above each position 
were identified by randomly mutating one to 
three codons at a time by using a cassette method 
and applying a functional selection (9). (B) The 
fractional solvent accessibility (42) of the wild- 
type side chain in the protein dimcr (43) relative 
to the same atoms in an Ala-X-Ala mode! tri pep- 
tide. 
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selection after cassette mutagenesis. A histogram of side chain 
solvent accessibility in the crystal structure of the dimer is also 
shown in Fig. 1. At six positions, only the wild-type residue or 
relatively conservative substitutions are allowed. Five of these 
positions are buried in the protein. In contrast, most of the highly 
exposed positions tolerate a wide range of chemically different side 
chains, including hydrophilic and hydrophobic residues. Hence, it 
seems that most of the structural information in this region of the 
protein is carried by the residues that are solvent inaccessible. 



Constraints on Core Sequences 

Because core residue positions appear to be extremely important 
for protein folding or stability, we must understand the factors that 
dictate whether a given core sequence will be acceptable. In general, 
only hydrophobic or neutral residues are tolerated at buried sites in 
proteins, undoubtedly because of the large favorable contribution of 
the hydrophobic effect to protein stability (19). For example, Fig. 2 
shows the results of genetic studies used to investigate the substitu- 
tions allowed at residue positions that form the hydrophobic core of 
the NH2-terminal domain of X repressor (20). The acceptable core 
sequences are composed almost exclusively of Ala, Cys, Thr, Val, lie, 
Leu, Met, and Phe. The acceptability of many different residues at 
each core position presumably reflects the fact that the hydrophobic 
effect, unlike hydrogen bonding, does not depend on specific 
residue pairings. Although it is possible to imagine a hypothetical 
core structure that is stabilized exclusively by residues forming 
hydrogen bonds and salt bridges, such a core would probably be 
difficult to construct because hydrogen bonds require pairing of 
donors and acceptors in an exact geometry. Thus the repertoire of 
possible structures that use a polar core would probably be extreme- 
ly limited (21). Polar and charged residues are occasionally found in 
the cores of proteins, but only at positions where their hydrogen 
bonding needs can be satisfied (22). 

The cores of most proteins arc quite closely packed (23) , but some 
volume changes are acceptable. In X repressor, the overall core 
volume of acceptable sequences can vary by about 10%. Changes at 
individual sites, however, can be considerably larger. For example, 
as shown in Fig. 2, both Phe and Ala are allowed at the same core 
position in the appropriate sequence contexts. Large volume 
changes at individual buried sites have also been observed in 



phylogenetic studies, where it has been noted that the size decreases 
and increases at interacting residues are not necessarily related in a 
simple complementary fashion (5, 7, 17). Rather, local volume 
changes are accommodated by conformational changes in nearby 
side chains and by a variety of backbone movements. 



The Informational Importance of the Core 

With occasional exceptions, the core must remain hydrophobic 
and maintain a reasonable packing density. However, since the core 
is composed of side chains that can assume only a limited number of 
conformations (24), efficient packing must be maintained without 
steric clashes. How important are hydrophobicity, volume, and 
steric complementarity in determining whether a given sequence can 
form an acceptable core? Each factor is essential in a physical sense, 
as a stable core is probably unable to tolerate unsatisfied hydrogen 
bonding groups, large holes, or steric overlaps (25). However, in an 
informational sense, these factors are not equivalent. For example, in 
experiments in which three core residues of X repressor were 
mutated simultaneously, volume was a relatively unimportant infor- 
mational constraint because three-quarters of all possible combina- 
tions of the 20 naturally occurring amino acids had volumes within 
the range tolerated in the core, and yet most of these sequences were 
unacceptable (20). In contrast, of the sequences that contained only 



Rg. 2. Amino acid substitu- 
tions allowed in the core of X 
repressor. The wild-type side 
chains arc shown pictorialty in 
the approximate orientation 
seen in the crystal structure 
(43). The lists of allowed sub- 
stitutions at each position are 
shown below the wild-type 
side chains. These substitu- 
tions were identified by ran- 
domly mutating one to four 
residues at a time by using a 
cassette method and applying 
a functional selection (20). 
Not all substitutions are al- 
lowed in every sequence back- 
ground. 



o</\J 



J/ 



40 36 47 65 51 57 18 



I 



I I I I 

Ah AM Cyi Cyi 



- Ala _ 

Cya Cys Cya 8«r Val Pro Cy« 

Val Thr Thr Cya II* Val Val 

lit Val Val Thr Uu U* IU 

Uu II* II* Val Mat Ltu Uu 

Mat Ltu Uu II* Ph# Mat Mat 

Ph* Mat M*t Uu Ph* 



16 MARCH 1990 



ARTICLES 1307 



the appropriate hydrophobic residues, a significant fraction were 
acceptable. Hence, the hydrophobicity of a sequence contains 
more information about its potential acceptability in the core than 
does the total side chain volume. Steric compatibility was intermedi- 
ate between volume and hydrophobicity in informational impor- 
tance. 



The Informational Importance of Surface Sites 

We have noted that many surface sites can tolerate a wide variety 
of side chains, including hydrophilic and hydrophobic residues. This 
result might be taken to indicate that surface positions contain little 
structural information. However, Bashford et a/., in an extensive 
analysis of globin sequences (4), found a strong bias against large 
hydrophobic residues at many surface positions. At one level, this 
may reflect constraints imposed by protein solubility, because large 
patches of hydrophobic surface residues would presumably lead to 
aggregation. At a more fundamental level, protein folding requires a 
partitioning between surface and buried positions. Consequently, to 
achieve a unique native state without significant competition from 
other conformations, it may be important that some sites have a 
decided preference for exterior rather than interior positions. As a 
result, many surface sites can accept hydrophobic residues individ- 
ually, but the surface as a whole can probably tolerate only a 
moderate number of hydrophobic side chains. 



Identification of Residue Roles from 
Sets of Sequences 

Often, a protein of interest is a member of a family of related 
sequences. What can we infer from the pattern of allowed substitu- 
tions at positions in sets of aligned sequences generated by genetic 
or phylogcnctic methods? Residue positions that can accept a 
number of different side chains, including charged and highly polar 
residues, arc almost certain to be on the protein surface. Residue 
positions that remain hydrophobic, whether variable or not, arc 
likely to be buried within the structure. In Fig. 3, those residue 
positions in X repressor that can accept hydrophilic side chains are 
shown in orange and those that cannot accept hydrophilic side 
chains are shown in green. The obligate hydrophobic positions 
define the core of the structure, whereas positions that can accept 
hydrophilic side chains define the surface. 

Functionally important residues should be conserved in sets of 
active sequences, but it is not possible to decide whether a side chain 
is functionally or structurally important just because it is invariant or 
conserved. To make this distinction requires an independent assay of 
protein folding. The ability of a mutant protein to maintain a stably 
folded structure can often be measured by biophysical techniques, 
by susceptibility to intracellular proteolysis (26), or by binding to 
antibodies specific for the native structure (27, 28)- In the latter 
cases, it is possible to screen proteins in mutated clones for the 
ability to fold even if these proteins arc inactive. Sets of sequences 
that allow formation of a stable structure can then be compared to 
the sets that allow both folding and function, with the active site or 
binding residues being those that arc variable in the set of stable 
proteins but invariant in the set of functional proteins. The DNA- 
binding residues of Arc repressor were identified by this method (8). 
The receptor-binding residues of human growth hormone were also 
identified by comparing the stabilities and activities of a set of 
mutant sequences (28). However, in this case, the mutants were 
generated as hybrid sequences between growth hormone and related 
hormones with different binding specificities. 



Implications for Structure Prediction 

At present, the only reliable method for predicting a low- 
resolution tertiary structure of a new protein is by identifying 
sequence similarity to a protein whose structure is already known 
(29, 30). However, it is often difficult to align sequences as the level 
of sequence similarity decreases, and it is sometimes impossible to 
detect statistically significant sequence similarity between distantly 
related proteins. Because the number of known sequences is far 
greater than the number of known structures, it would be advanta- 
geous to increase the reach of the available structural information by 
improving methods for detecting distant sequence relations and for 
subsequently aligning these sequences based on structural principles. 
In a normal homology search, the sequence database is scanned with 
a single test sequence, and every residue must be weighted equally. 
However, some residues arc more important than others and should 
be weighted accordingly. Moreover, certain regions of the protein 
arc more likely to contain gaps than others. Both kinds of informa- 
tion can be obtained from sequence sets, and several techniques have 




Flfl. 3. Tolerance of positions in the NH r terminal domain of X repressor to 
hydrophilic side chains. The complex (43) of the repressor dimer (blue) and 
operator DNA (white) is shown. In (A), positions that can tolerate 
hydrophilic side chains are shown in orange. The same side chains are shown 
in (B) without the remaining protein atoms. In (C), positions that require 
hydrophobic or neutral side chains arc shown in green. These side chains arc 
shown in (D) without the remaining protein atoms. About three-fourths of 
the 92 side chains in the NH r tcrminal domain arc included in both (B) and 
(D). The remaining positions have not been tested. Data are from (9, 14, 20, 
27, 44). 
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been used to combine such mfonrtation into more appropriately 
weighted sequence searches and alignments (31). These methods 
were used to align the sequences of retroviral proteases with aspartic 
proteases, which in rum allowed construction of a three-dimension- 
al model for the protease of human immunodeficiency virus type 1 
(29). Comparison with the recently determined crystal structure of 
this protein revealed reasonable agreement in many areas of the 
predicted structure (32). 

The structural information at most surface sites is highly degener- 
ate. Except for functionally important residues, exterior positions 
seem to be important chiefly in maintaining a reasonably polar 
surface. The information contained in buried residues is also 
degenerate, the main requirement being that these residues remain 
hydrophobic. Thus, at its most basic level, the key structural 
message in an amino acid sequence may reside in its specific pattern 
of hydrophobic and hydrophilic residues. This is meant in an 
informational sense. Clearly, the precise structure and stability of a 
protein depends on a large number of detailed interactions. It is 
possible, however, that structural prediction at a more primitive 
level can be accomplished by concentrating on the most basic 
informational aspects of an amino acid sequence. For example, 
amphipathic patterns can be extracted from aligned sets of sequences 
and used, in some cases, to identify secondary structures. 

If a region of secondary structure is packed against the hydropho- 
bic core, a pattern of hydrophobic residues reflecting the periodicity 
of the secondary structure is expected (33, 34). These patterns can be 
obscured in individual sequences by hydrophobic residues on the 
protein surface. It is rare, however, for a surface position to remain 
hydrophobic over the course of evolution. Consequently, the am- 
phipathic patterns expected for simple secondary structures can be 
much clearer in a set of related sequences (6). This principle is 
illustrated in Fig. 4, which shows helical hydrophobic moment plots 
for the Antennapedia homcodomain sequence (Fig. 4A) and for a 
composite sequence derived from a set of homologous homcodo- 
main proteins (Fig. 4B) (35). The hydrophobic moment is a simple 
measure of the degree of amphipathic character of a sequence in a 
given secondary structure (34). The amphipathic character of the 
three o-hclical regions in the Antennapcdia protein (36) is dearly 
revealed only by the analysis of the combined set of homcodomain 
sequences. The secondary structure of Arc repressor, a small DNA- 
binding protein, was recently predicted by a similar method (8) and 
confirmed by nuclear magnetic resonance studies (37). 

The specific pattern of hydrophobic and hydrophilic residues in 
an amino add sequence must limit the number of different structures 
a given sequence can adopt and may indeed define its overall fold. If 
this is true, then the arrangement of hydrophobic and hydrophilic 
residues should be a characteristic feature of a particular fold. Sweet 
and Eisenbcrg have shown that the correlation of the pattern of 
hydrophobicity between two protein sequences is a good criterion 
for their structural relatcdness (38). In addition, several studies 
indicate that patterns of obligatory hydrophobic positions identified 
from aligned sequences are distinctive features of sequences that 
adopt the same structure (4, 29, 38, 39). Thus, the order of 
hydrophobic and hydrophilic residues in a sequence may actually be 
suffident information to determine the basic folding pattern of a 
protein sequence. 

Although the pattern of sequence hydrophobicity may be a 
characteristic feature of a particular fold, it is not yet dear how such 
patterns could be used for prediction of structure dc novo. It is 
important to understand how patterns in sequence space can be 
related to structures in conformation space. Lau and Dill have 
approached this problem by studying the properties of simple 
sequences composed only of H (hydrophobic) and P (polar) groups 
on two-dimensional lattices (40). An example of such a representa- 
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tion is shown in Fig. 5. Residues adjacent in the sequence must 
occupy adjacent squares on the lattice, and two residues cannot 
occupy the same space. Free energies of particular conformations arc 
evaluated with a single term, an attraction of H groups. By 
considering chains of ten residues, an exhaustive conformational 
search for all 1024 possible sequences of H and P residues was 
possible. For longer sequences only a representative fraction of the 
allowed sequence or conformation space could be explored. The 
significant results were as follows: (i) not all sequences can fold into 
a "native" structure and only a few sequences form a unique native 
structure; (ii) the probability that a sequence will adopt a unique 
native structure increases with chain length; and (iii) the native 
states arc compact, contain a hydrophobic core surrounded by polar 
residues, and contain significant secondary structure. Although the 
gap between these rWdimensiona! simulations and three-dimen- 
sional structures is large, the use of simple rules and sequence 
representations yidds results similar to those expected for real 
proteins. Three-dimensional lattice methods arc also beginning to 
be devdoped and evaluated (4i). 



Summary 

There is more information in a set of related sequences than in a 
single sequence. A number of practical appkeations arise from an 
analysis of the tolerance of residue positions to change. First, such 
information permits the evaluation of a residue's importance to the 
function and stability of a protein. This ability to identity the 
essential elements of a protein sequence may improve our under- 
standing of the determinants of protein folding and stability as well 
as protein function. Second, patterns of tolerance to amino add 
substitutions of varying hydrophilidty can help to identify residues 
likely to be buried in a protein structure and those likely to occupy 



Rg, 4. Helical hydro- 
phobic moments calcu- 
lated by using (A) the 
Antennapedia homcodo- 
main sequence or (B) a 
set of 39 aligned homco- 
domain sequences (35). 
The bars indicate the ex- 
tent of the helical re- 
gions identified in nucle- 
ar magnetic resonance 
studies of the Antenna- 
pcdia homcodomain 
(jrf). To determine hy- 
drophobic moments, 
residues were assigned 
to one of three groups: 
HI (high hydrophobtci- 
ty * Trp, lie, Phc, Leu, 
Met, Val, or Cys); H2 
(medium hydrophobic- 
ity = Tyr, Pro, Ala, Thr, 
Gin, Asn, Glu, Asp, Lys, 




30 40 
Center of window 



His, Gly, or Scr); and H3 (low hydrophobidry 

or Arg). For the aligned homcodomain sequences, the residues at each 
position were sorted by their hydrophobicity by using the scale of Faucherc 
and Plisb (45). Arg and Lys were not counted unless no other residue was 
found at the position, because they contain long aliphatic side chains and can 
thereby substitute for nonpolar residues at some buried sites. To account for 
Dossibie sequence errors and rare exceptions, the most hydrophilic residue 
allowed at each position was discarded unless it was observed twice. The 
second most hydrophilic residue was then chosen to represent the hydropho- 
bicity of each position. An eight-residue window was used and the vectors 
projected radially every 100°. The vector magnitudes were assigned a value of 
1, 0, or - 1 for positions where the hydrophobicity group was HI, H2, or 
H3, respectively. 
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Rg. 5. A representation of one com- 
pact conformation for a particular 
sequence of H and P residues on a 
two-dimensional square lattice. 
[Adapted from (40), with permis- 
sion of the American Chemical Soci- 
ety] 



surface positions. The amphipathic patterns that emerge can be used 
to identify probable regions of secondary structure. Third, incorpo- 
rating a knowledge of allowed substitutions can improve the ability 
to detect and align distantly related proteins because the essential 
residues can be given prominence in the alignment scoring. 

As more sequences are determined, it becomes increasingly likely 
that a protein of interest is a member of a family of related 
sequences. If this is not the case, it is now possible to use genetic 
methods to generate lists of allowed amino acid substitutions. 
Consequently, at least in the short term, it may not be necessary to 
solve the folding problem for individual protein sequences. Instead, 
information from sequence sets could be used. Perhaps by simplify- 
ing sequence space through the identification of key residues, and by 
simplifying conformation space as in the lattice methods, it will be 
possible to develop algorithms to generate a limited number of trial 
structures. These trial structures could then, in turn, be evaluated by 
further experiments and more sophisticated energy calculations. 
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Appendix B 

1- Horology botwesn SEP ID ND;2 of the Invention and th e amino aoid seauen 



Score - 358 bits (920) , Expect * 1e-97 

Identities = 184/378 (48%) ■ Positivoe = 248/378 (66%) , 6ap« = 3/378 (OW 

Query : SEQ |0 NQ:2 of the invention 
Sbjct ; YAL06OW of D1 

Query: 1 MXGLLYYGTND I RYSETVPEPE I KNPNDVK I KVSYC6 1 C8TDLXEFTYSGGPVFFPKQGT 60 

M+ L Y+ Dl ++ +P PEI+ ++V I VS+CGICG+DL E Y 6P+F PK Q 
Sbjct: 1 MRAUYFKKSDIHFTNDIPRPEIQTMEVIIDVSWMICQSDLHE— YLDGPIFMPKDGE 58 

Query. 61 KDKJSfiYELPLCPGHEFSGTWEVfiSGVTSVKPGDRVAVEATSHCSDRSRYKDTVAODLQ 120 

K*S UPU GHE SQ V +V6 VT VK 00 V V+A S C+D * + + 
Sbjot: 59 CHKLSNAALPU*GHE«GIVSKVGPKVTKVKVGDHYVVDM55CADLHCWPHSKFYNSK 119 

Query. 121 LCHACQSGSPNCCASLSFCGLGGASG6FAEYVVYQEDW|VKLPDSI pJ)B] GALVEP I SVA 180 

C ACQ <i$ N C F QU» SGSFAE W + K+<- +P IP ALVEP+SV 
Sbjot: 119 PCDACQRGSENLCTHAGFVGLGVI SGGFAEQWVSQHH I ( PVPtfEJ PLDVAALVEPUSVT 178 

Query. 181 HHAVERARFQPGQTALVIGGGP I GLAT \ LALQGHHAGK fVCSEPAL I RRQFAKELGAEVF 240 

WHAV+ *■ F+ Q +ALVL6 GPIGL TIL L*G A XIV SE A R + AK+LG EVF 
Sbjot: 179 WHAVK I SGFKKGSS ALVLGAOP I GLCT I LVLKGM6ASK I W5E 1 AERR I EBAKKL6VE VF 238 

Query: 241 DPSTCDQW-AVUWVPENEaFKAAFOCSQVW 299 

+PS + ^ -h+fiF +KKSG+ TF TS+ A G A (WWWG P* 
Sbjot: 239 NP5KHGHKS I E I LRGLTKSHDGFDY5YDCSG ( QVTFETSLK ALTFK8TATN I AVWQPKPV 298 

Query: 300 GFMrWSLTYQEKYATGSMCYTVKDFQEWKALEDGL I SLDKARKK I TGKVHLKDGVEKQF 359 

F PM +T QEK TfiS+ Y V+ F+EW>A+ +6 l« -m-MTGK <-HJG EKGF 
Sbjot: Z99 PFQPHOYTLQEKVMTGS ISYWEAFEEWRA I HNGD 1 AMEOCKQL J TGKQR I ED3WEKGF 35 B 



Query. 360 KQL I EHKENNVK I LVTPN 377 

- ++L-W-HKE+NVX I L+TPN 
Sbjot: 359 QELMDHKESNVK I LLTPN 376 



Appendix B (con't) 



2. Homology betwen SEP ID H0:l of the invention and the SEQ IP NQ : 1 of P3 
Query ; SEO t D NO : 1 of the Invention (1143 letters) 
Sbjot = SEC UO N0'1 of D3 (Length = 1168) 

Score - 24 bits (12), Expect = 0.0e4 
Identities = 12/12 (100W 

Strand = Ptue / Plus 

Query: 718 ttcgatccttct 729 

lllillllltll 
Sbjct: 712 ttcgatccttct 723 



Score = 24 brta (12), Expect = O.Q64 
[dcntitlca = 12/12 (100K) 
Strand - Plua / Plua 

Query: 641 t«o*teetgtt 552 

IIIJIIIIMII 

Sbjct. 535 tfttattfctftt 546 



Score = 22 bits (11), Expect * 0.26 
Identities = 11/11 (100%) 
Strand * Plus / Plue 

Query; 479 toaagetgoca 469 

illtlllllll 
Sbjotv 884 tcaagctgoca 604 



Score = 22 bite (11). Expect = 0,25 
Identltlea = 11/11 (lOOft) 
Strand = Plu* / Plua 

Query: 892 ccaatteeatt 902 

i 1 1 £ 1 1 1 1 1 1 1 

Sbjct' 598 ccaattuatt 608 



Appendix B (con't) 



3. Horology between SEP ID KO-7 of the invention and the SEP ID NO: 2 of P3 
Query : SEO ID NO: 2 of the invention (380 letters) 
Sbjet ~ SEC UD NQ:2 of 03 (Length =; 400) 

Score = 396 bit* (1018), Expect = e-115 

Itftwitities ° 201/378 (530. Poeltfvee = 254/378 (67S), Gaps * 3/378 (0%) 

Ouery. 1 MKGLLYYGTND I RYSETVPEPE I KNPNOVK I KVSYCG I CGTDLKEFTYSGGPVFFPKQGT 60 

•H- L Y+Q DIRY++ + EP |+ + -H-l+VS-tCGICG+DL E Y GP+FFP+ Q 
Sbjct: 1 K8ALAYF0KQDIRYTKBLEEPVI ETODG I E I EVSWCG I CGSDLHE— YLDGPI FFPEDGK 58 

Query: €1 KDKISGmPL(miEFXXXXXXXX)«^^ 120 

+SG LP GHE K GO V VEAT CO * f 

Sbjct: 59 VHOVSfiLCLPQAIKfflEMSQ I VSKVGPKVTN I KAGDHWVEATGTCUJWTWPNAAHAKDA 118 

Ouery. 121 LCHACQSBSPNCCASLSFCQLG^GOFAEYVVYGE&HMVKLPDS I PDDI GALVEPI SVA 180 

C ACQ G NCCA I F GLG SQ9FAE W E H+VK+P-H-H* D+ ALVEPISV> 
Sbjct: 119 ECAACQRGf^CCAHLQFKGLQVHSQflFAEKVYVSEKHWKlPNTLPLDVAALVEPlSVS 178 

Query: 181 WHAVERARFQPGQTALVlGSGP I OLAT I LALQGHHAGK I VCSEPAL \ RRQFAKELGAEVF 240 

WHAV ++ Q GQ+ALVLG GPlGLATtLALQOH A KtV SEPA IRR A +LG E F 
Sbjct: 179 VIHAVR I SKLOKGOSALVLGAGP \ GLAT I LAL06HGASK I WSEPAE I RRNQAAKL9VETP 238 

Query: 241 DPST^DDANAVLKAMVPEMEflFHAAFDCSGVPOTfTTS I VATGPSG I AVNVAVWGDHP I 298 

DPS +0A +LK + P EOF A+OCSGV TP T + AT 0+ VN+A+WG PI 
Sbjct- 239 DPSEWEDAVNILKKLAPGGEGFDFAYDOSGVKPTFBTGVHATTFRGIIIYVNIAIWGHKPI 298 

Query: 300 GFlffilSi.TYQ£KYATGSMCYTVltl)FQEWKALEDGH SLPKARKM I TGKVHLKDGVEKGF 359 

F P|l +T QEK+ TGSMCYT+KDF-HVV+AL 40 J++OKAR +IT6+ ++0G KGF 
Sbjct: 299 DFKPKOVTLQEKfVTGSWCYT IKDFEDVVQALCNGSI AHJKARhL! TGRQKI EDGFTKGF 358 



Qyery: 
Sbjct; 



360 KOL I EHXENNVK I LVTPN 377 

«.+ HKE N+KIL+TPN 
369 DELMNHKEKN I K ) LLTPN 376 



